This coursework focuses on housing prices, with the main objective being to predict the price of a property based on various inputs. The inputs include features such as the area, the number and types of rooms, and additional factors like the availability of a main road, hot water heating, and more.
The dependent variable is the price, as it is the primary concern for most people searching for a house. The goal of this work is to predict the price based on diverse inputs, which consist of mixed data types, such as:
This project addresses a regression problem because the objective is to predict a numeric value—in this case, the price of the property.
Now we are going to import our dataset into this project.
dt_houses <- fread(file = "./Datasets/Regression_set.csv")
I would like to check, if i have some nullish data in my
dataset. I think it is a good idea to go through all rows and colums and
check, if there is a NA. I want to check it with built-in function in R
complete.cases(data_table). This function returns TRUE or FALSE
if row contains a NA value.
nas <- dt_houses[!complete.cases(dt_houses)]
nas
That looks great, now we can explore our dataset :)
Before we will explore our data, I want to import all libraries, which we will probably use:
library(data.table)
library(ggcorrplot)
library(ggExtra)
library(ggplot2)
library(ggridges)
library(ggsci)
library(ggthemes)
library(RColorBrewer)
library(svglite)
library(viridis)
library(scales)
library(rpart)
library(rpart.plot)
I found some helpful functions in R, so we could have a look on our data. We will start with a structure, than we will get some statistic data and take a head() of the data
str(dt_houses)
Classes ‘data.table’ and 'data.frame': 545 obs. of 13 variables:
$ price : int 13300000 12250000 12250000 12215000 11410000 10850000 10150000 10150000 9870000 9800000 ...
$ area : int 7420 8960 9960 7500 7420 7500 8580 16200 8100 5750 ...
$ bedrooms : int 4 4 3 4 4 3 4 5 4 3 ...
$ bathrooms : int 2 4 2 2 1 3 3 3 1 2 ...
$ stories : int 3 4 2 2 2 1 4 2 2 4 ...
$ mainroad : chr "yes" "yes" "yes" "yes" ...
$ guestroom : chr "no" "no" "no" "no" ...
$ basement : chr "no" "no" "yes" "yes" ...
$ hotwaterheating : chr "no" "no" "no" "no" ...
$ airconditioning : chr "yes" "yes" "no" "yes" ...
$ parking : int 2 3 2 3 2 2 2 0 2 1 ...
$ prefarea : chr "yes" "no" "yes" "yes" ...
$ furnishingstatus: chr "furnished" "furnished" "semi-furnished" "furnished" ...
- attr(*, ".internal.selfref")=<externalptr>
Statistic data:
summary(dt_houses[, .(price, area, bedrooms, bathrooms, stories, parking)])
price area bedrooms bathrooms stories parking
Min. : 1750000 Min. : 1650 Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.0000
1st Qu.: 3430000 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000
Median : 4340000 Median : 4600 Median :3.000 Median :1.000 Median :2.000 Median :0.0000
Mean : 4766729 Mean : 5151 Mean :2.965 Mean :1.286 Mean :1.806 Mean :0.6936
3rd Qu.: 5740000 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.0000
Max. :13300000 Max. :16200 Max. :6.000 Max. :4.000 Max. :4.000 Max. :3.0000
and this is a sample of our dataset:
head(dt_houses)
I would like to start from density of a main values, which are from my domain knowledge are important in price of the properties
We will start with price density:
ggplot(data = dt_houses, aes(x = price)) +
geom_density(fill="#f1b147", color="#f1b147", alpha=0.25) +
labs(
x = 'Price',
y = 'Density'
) +
geom_vline(xintercept = mean(dt_houses$price), linetype="dashed") +
scale_x_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
Also it would be greate to have a look at area density:
ggplot(data = dt_houses, aes(x = area)) +
geom_density(fill="#f1b147", color="#f1b147", alpha=0.25) +
labs(
x = 'Price',
y = 'Density'
) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
This is interesting, how does area affect price of the house. We
will plot it with points, where price is on the y-axis and area on
x-axis.
ggplot() +
geom_point(data = dt_houses, aes(x = area, y = price, color = parking)) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
This looks nice, and it is also logical, more space, higher price.
But, now I have the simplest idea, how does amount of bedrooms correlates with the price.
ggplot(data = dt_houses, aes(x = factor(bedrooms), y = price)) +
geom_boxplot() +
theme_minimal()
We can see, that on average, more bedrooms, means higher price, but I think there is not really strong relationship between this two variables.
Also it would be great to take a look at a bedrooms histogram:
ggplot(data = dt_houses, aes(x = bedrooms)) +
geom_histogram(fill="#2f9e44", color="#2f9e44", alpha=0.25) +
geom_vline(xintercept = mean(dt_houses$bedrooms), linetype="dashed") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
Also I want to show you the mean of the bedrooms:
mean(dt_houses$bedrooms)
[1] 2.965138
Here we can see, that the most of the properties tend to have 2, 3 or 4 rooms.
Let’s also have a look at histogram of stroies:
ggplot(data = dt_houses, aes(x = stories)) +
geom_histogram(fill="#2f9e44", color="#2f9e44", alpha=0.25) +
geom_vline(xintercept = mean(dt_houses$stories), linetype="dashed") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
mean(dt_houses$stories)
[1] 1.805505
It is interesting how much real estate furnished or not
ggplot(data = dt_houses, aes(x = factor(furnishingstatus), fill = factor(furnishingstatus))) +
geom_bar(color="#ced4da", alpha=0.25) +
scale_fill_viridis_d(option = "D") +
labs(title = "Bar Chart with Different Colors",
x = "Furnishing Status",
y = "Count") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
We can see, that most of the houses are semi-furnished.
Now, it would be great, to look at price and area distribution in differently furnished properties
ggplot(data = dt_houses, aes(y = price, x = area)) +
geom_point(data = dt_houses, aes(y = price, x = area, color = bedrooms)) +
geom_hline(yintercept = mean(dt_houses$price), linetype='dashed') +
facet_grid(.~furnishingstatus) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
scale_color_distiller(type = "seq", palette = "Greens") +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
Also, on average, you can notice, that unfurnished houses, are less expencive.
We can also take a look on some pie charts:
dt_mainroad_counts <- as.data.frame(table(dt_houses$mainroad)) #table() - creates frequency table
colnames(dt_mainroad_counts) <- c("mainroad_status", "count")
dt_mainroad_counts$percentage <- round(dt_mainroad_counts$count / sum(dt_mainroad_counts$count) * 100, 1)
ggplot(data = dt_mainroad_counts, aes(x = "", y = count, fill = mainroad_status)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(percentage, "%")),
position = position_stack(vjust = 0.5), color = "white", size = 4) +
theme_void() +
scale_fill_manual(values = c("#F1B147", "#47B1F1")) +
labs(
title = "Distribution of Mainroad Status",
fill = "Mainroad Status"
)
I think that would be enough exploration and we can start with our first model.
First, I would like to start pretty simple with linear model.
I consider to take this variables in my model: area, bedrooms, bathrooms, hotwaterheating, airconditioning, stories, mainroad, parking and furnishingstatus.
I will use lm function in R to find needed beta coefficients and create my model
price_lm <- lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms + guestroom + basement + prefarea, data = dt_houses)
summary(price_lm)
Call:
lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning +
stories + mainroad + parking + furnishingstatus + bathrooms +
guestroom + basement + prefarea, data = dt_houses)
Residuals:
Min 1Q Median 3Q Max
-2619718 -657322 -68409 507176 5166695
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42771.69 264313.31 0.162 0.871508
area 244.14 24.29 10.052 < 2e-16 ***
bedrooms 114787.56 72598.66 1.581 0.114445
hotwaterheatingyes 855447.15 223152.69 3.833 0.000141 ***
airconditioningyes 864958.31 108354.51 7.983 8.91e-15 ***
stories 450848.00 64168.93 7.026 6.55e-12 ***
mainroadyes 421272.59 142224.13 2.962 0.003193 **
parking 277107.10 58525.89 4.735 2.82e-06 ***
furnishingstatussemi-furnished -46344.62 116574.09 -0.398 0.691118
furnishingstatusunfurnished -411234.39 126210.56 -3.258 0.001192 **
bathrooms 987668.11 103361.98 9.555 < 2e-16 ***
guestroomyes 300525.86 131710.22 2.282 0.022901 *
basementyes 350106.90 110284.06 3.175 0.001587 **
prefareayes 651543.80 115682.34 5.632 2.89e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1068000 on 531 degrees of freedom
Multiple R-squared: 0.6818, Adjusted R-squared: 0.674
F-statistic: 87.52 on 13 and 531 DF, p-value: < 2.2e-16
We got 0.64 R-squared, which is not that bad for a model just made up. But that’s not all, I will try to do better here, but first, another model.
But I would like to measure performance of my models with MSE, so I will calculate MSE for linear model.
price_lm_mse <- mean(price_lm$residuals^2)
price_lm_mse
[1] 1.111188e+12
I think this model could perform better, because there some variables which can affect this model not only linearly, but the other way, in this case tree model can show better performance
prices_tree <- rpart(data = dt_houses, formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms + guestroom + basement + prefarea, method = 'anova')
prp(prices_tree, digits = -3)
printcp(prices_tree)
Regression tree:
rpart(formula = price ~ area + bedrooms + hotwaterheating + airconditioning +
stories + mainroad + parking + furnishingstatus + bathrooms +
guestroom + basement + prefarea, data = dt_houses, method = "anova")
Variables actually used in tree construction:
[1] airconditioning area basement bathrooms furnishingstatus parking
Root node error: 1.9032e+15/545 = 3.4921e+12
n= 545
CP nsplit rel error xerror xstd
1 0.304946 0 1.00000 1.00216 0.084839
2 0.094553 1 0.69505 0.72713 0.062856
3 0.053743 2 0.60050 0.63275 0.055367
4 0.026381 3 0.54676 0.59669 0.051745
5 0.024922 4 0.52038 0.61131 0.052238
6 0.022993 5 0.49546 0.60175 0.049671
7 0.021374 6 0.47246 0.59794 0.050744
8 0.015261 7 0.45109 0.58060 0.050909
9 0.013952 8 0.43583 0.56299 0.050087
10 0.012386 9 0.42188 0.55418 0.049659
11 0.010000 10 0.40949 0.55545 0.050225
prices_tree
n= 545
node), split, n, deviance, yval
* denotes terminal node
1) root 545 1.903208e+15 4766729
2) area< 5954 361 6.066751e+14 4029993
4) bathrooms< 1.5 293 3.297298e+14 3773561
8) area< 4016 174 1.437122e+14 3431227
16) furnishingstatus=unfurnished 78 4.036605e+13 2977962 *
17) furnishingstatus=furnished,semi-furnished 96 7.430067e+13 3799505 *
9) area>=4016 119 1.358098e+14 4274118 *
5) bathrooms>=1.5 68 1.746610e+14 5134912
10) airconditioning=no 44 7.024826e+13 4563682 *
11) airconditioning=yes 24 6.373358e+13 6182167 *
3) area>=5954 184 7.161564e+14 6212174
6) bathrooms< 1.5 108 2.869179e+14 5382579
12) airconditioning=no 65 1.170629e+14 4843569
24) basement=no 38 5.226335e+13 4304816 *
25) basement=yes 27 3.824662e+13 5601815 *
13) airconditioning=yes 43 1.224240e+14 6197360 *
7) bathrooms>=1.5 76 2.492851e+14 7391072
14) parking< 1.5 51 7.184700e+13 6859794 *
15) parking>=1.5 25 1.336772e+14 8474878
30) airconditioning=no 10 5.146311e+13 7285600 *
31) airconditioning=yes 15 5.864106e+13 9267729 *
plotcp(prices_tree)
prices_tree_min_cp <- prices_tree$cptable[which.min(prices_tree$cptable[, "xerror"]), "CP"]
model_tree <- prune(prices_tree, cp = prices_tree_min_cp )
prp(prices_tree,digits = -3)
after we pruned the tree, let’s calculate the MSE for the tree model
prices_tree_pred <- predict(prices_tree, dt_houses[, c("area","bathrooms", "bedrooms", "hotwaterheating", "airconditioning", "parking", "stories", "mainroad", "furnishingstatus", "guestroom", "basement", "prefarea")])
prices_tree_mse <- mean((dt_houses$price - prices_tree_pred)^2)
prices_tree_mse
[1] 1.429988e+12
price linear model has a MSE of
price_lm_mse
[1] 1.111188e+12
price tree model has a MSE of
prices_tree_mse
[1] 1.429988e+12
It is surprising for me, as for a person who does not have a lot of experience in modelling, that linear model performs better than tree model by approx. 22%.
100 - price_lm_mse / prices_tree_mse * 100
[1] 22.29392
Here I would like to try all ideas and observations, which I’ve had through my course work. I’ve seen two columns, such as “bedrooms” and “bathrooms”, they store numerical value, amount of this kind of rooms. It makes sense for me to create a new column “room_count”, because it may have bigger impact on the performance.
dt_houses[, 'room_count' := bathrooms + bedrooms]
Let’s try Model with a new variable
price_lm_2 <- lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms + guestroom + basement + prefarea + room_count, data = dt_houses)
summary(price_lm_2)
Call:
lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning +
stories + mainroad + parking + furnishingstatus + bathrooms +
guestroom + basement + prefarea + room_count, data = dt_houses)
Residuals:
Min 1Q Median 3Q Max
-2619718 -657322 -68409 507176 5166695
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42771.69 264313.31 0.162 0.871508
area 244.14 24.29 10.052 < 2e-16 ***
bedrooms 114787.56 72598.66 1.581 0.114445
hotwaterheatingyes 855447.15 223152.69 3.833 0.000141 ***
airconditioningyes 864958.31 108354.51 7.983 8.91e-15 ***
stories 450848.00 64168.93 7.026 6.55e-12 ***
mainroadyes 421272.59 142224.13 2.962 0.003193 **
parking 277107.10 58525.89 4.735 2.82e-06 ***
furnishingstatussemi-furnished -46344.62 116574.09 -0.398 0.691118
furnishingstatusunfurnished -411234.39 126210.56 -3.258 0.001192 **
bathrooms 987668.11 103361.98 9.555 < 2e-16 ***
guestroomyes 300525.86 131710.22 2.282 0.022901 *
basementyes 350106.90 110284.06 3.175 0.001587 **
prefareayes 651543.80 115682.34 5.632 2.89e-08 ***
room_count NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1068000 on 531 degrees of freedom
Multiple R-squared: 0.6818, Adjusted R-squared: 0.674
F-statistic: 87.52 on 13 and 531 DF, p-value: < 2.2e-16
mean(price_lm_2$residuals^2)
[1] 1.111188e+12
this is absolutely the same.
what if we will try to bring the area variable closer to Gaussian with log transformation
dt_houses[, area_log := log(area)]
little visualisation:
ggplot(data = dt_houses, aes(x = area_log)) +
geom_density(fill="#f1b147", color="#f1b147", alpha=0.25) +
labs(
x = 'Price',
y = 'Density'
) +
theme_minimal() +
theme(axis.line = element_line(color = "#000000"))
and try model again :)
price_lm_2 <- lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + bathrooms + guestroom + basement + prefarea + area_log, data = dt_houses)
summary(price_lm_2)
Call:
lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning +
stories + mainroad + parking + furnishingstatus + bathrooms +
guestroom + basement + prefarea + area_log, data = dt_houses)
Residuals:
Min 1Q Median 3Q Max
-2607115 -665756 -73006 497325 5120891
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.716e+06 3.455e+06 -2.523 0.011936 *
area 4.404e+01 8.233e+01 0.535 0.592912
bedrooms 1.175e+05 7.224e+04 1.627 0.104283
hotwaterheatingyes 8.585e+05 2.220e+05 3.867 0.000124 ***
airconditioningyes 8.214e+05 1.092e+05 7.525 2.28e-13 ***
stories 4.475e+05 6.386e+04 7.007 7.41e-12 ***
mainroadyes 3.471e+05 1.445e+05 2.403 0.016608 *
parking 2.689e+05 5.832e+04 4.612 5.01e-06 ***
furnishingstatussemi-furnished -7.058e+04 1.164e+05 -0.607 0.544418
furnishingstatusunfurnished -4.288e+05 1.258e+05 -3.410 0.000699 ***
bathrooms 9.814e+05 1.029e+05 9.540 < 2e-16 ***
guestroomyes 2.419e+05 1.331e+05 1.818 0.069629 .
basementyes 3.678e+05 1.099e+05 3.345 0.000880 ***
prefareayes 6.727e+05 1.154e+05 5.830 9.66e-09 ***
area_log 1.169e+06 4.596e+05 2.542 0.011290 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1062000 on 530 degrees of freedom
Multiple R-squared: 0.6856, Adjusted R-squared: 0.6773
F-statistic: 82.57 on 14 and 530 DF, p-value: < 2.2e-16
mean(price_lm_2$residuals^2)
[1] 1.097798e+12
it performs approx 0.4% better, if we look at R-Squared error.
dt_houses_2 <- dt_houses
dt_houses_2[, bathrooms_factor := factor(bathrooms)]
price_lm_2 <- lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning + stories + mainroad + parking + furnishingstatus + guestroom + basement + prefarea + bathrooms_factor, data = dt_houses_2)
summary(price_lm_2)
Call:
lm(formula = price ~ area + bedrooms + hotwaterheating + airconditioning +
stories + mainroad + parking + furnishingstatus + guestroom +
basement + prefarea + bathrooms_factor, data = dt_houses_2)
Residuals:
Min 1Q Median 3Q Max
-2627763 -656980 -65726 518665 5227119
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1029984.6 265787.7 3.875 0.000120 ***
area 242.5 24.3 9.980 < 2e-16 ***
bedrooms 119341.5 72605.1 1.644 0.100831
hotwaterheatingyes 868831.3 223056.0 3.895 0.000111 ***
airconditioningyes 872141.5 108830.1 8.014 7.15e-15 ***
stories 449954.7 64397.6 6.987 8.47e-12 ***
mainroadyes 426087.9 142121.4 2.998 0.002845 **
parking 269842.9 58629.5 4.603 5.23e-06 ***
furnishingstatussemi-furnished -41952.7 116924.0 -0.359 0.719886
furnishingstatusunfurnished -411835.0 126567.3 -3.254 0.001211 **
guestroomyes 308341.9 131668.8 2.342 0.019561 *
basementyes 355061.3 110211.0 3.222 0.001353 **
prefareayes 661344.4 115716.9 5.715 1.83e-08 ***
bathrooms_factor2 916763.3 119959.5 7.642 1.01e-13 ***
bathrooms_factor3 2081687.7 352189.8 5.911 6.11e-09 ***
bathrooms_factor4 4661918.8 1089300.6 4.280 2.22e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1067000 on 529 degrees of freedom
Multiple R-squared: 0.6837, Adjusted R-squared: 0.6747
F-statistic: 76.21 on 15 and 529 DF, p-value: < 2.2e-16
mean(price_lm_2$residuals^2)
[1] 1.104724e+12
Engineer a minimum of two new features based on your data exploration or on theoretical considerations. Add these features to your models and reevaluate their performance on the same performance metrics as before.